Learning from Data Streams

نویسندگان

  • João Gama
  • Pedro Pereira Rodrigues
چکیده

In the last two decades, machine learning research and practice has focused on batch learning usually with small datasets. In batch learning, the whole training data is available to the algorithm that outputs a decision model after processing the data eventually (or most of the times) multiple times. The rationale behind this practice is that examples are generated at random accordingly to some stationary probability distribution. Also, most learners use a greedy, hill-climbing search in the space of models. What distinguishes current data sets from earlier ones are the continuous flow of data and the automatic data feeds. We do not just have people who are entering information into a computer. Instead, we have computers entering data into each other. Nowadays there are applications in which the data is modelled best not as persistent tables but rather as transient data streams. In some applications it is not feasible to load the arriving data into a traditional DataBase Management Systems (DBMS), and traditional DBMS are not designed to directly support the continuous queries required in these application (Babcock et al., 2002). These sources of data are called Data Streams. There is a fundamental difference between learning from small datasets and large datasets. As pointed-out by some researchers (Brain & Webb, 2002), current learning algorithms emphasize variance reduction. However, learning from large datasets may be more effective when using algorithms that place greater emphasis on bias management. Algorithms that process data streams deliver approximate solutions, providing a fast answer using few memory resources. They relax the requirement of an exact answer to an approximate answer within a small error range with high probability. In general, as the range of the error decreases the space of computational resources goes up. In some applications, mostly database oriented, an approximate answer should be within an admissible error margin. Some results on tail inequalities provided by statistics are useful to accomplish this goal.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Active Learning from Stream Data

In this paper, we propose a new research problem on active learning from data streams where data volumes grow continuously. The objective is to label a small portion of stream data from which a model is derived to predict future instances as accurately as possible. We propose a classifier-ensemble based active learning framework which selectively labels instances from data streams to build an e...

متن کامل

Learning from Data Streams with Concept Drift

Increasing access to incredibly large, nonstationary datasets and corresponding demands to analyse these data has led to the development of new online algorithms for performing machine learning on data streams. An important feature of real-world data streams is " concept drift, " whereby the distributions underlying the data can change arbitrarily over time. The presence of concept drift in a d...

متن کامل

Multimodal Data Collection and Analysis of Collaborative Learning through an Intelligent Tutoring System

A great deal of learning analytics research has focused on what can be achieved by analyzing log data, which can yield important insights about how students learn in online systems. Log data cannot capture all important learning phenomena, especially in open-ended, collaborative, or project-based environments. Collecting and processing/analyzing additional multimodal data streams, however, pres...

متن کامل

Enabling Lazy Learning for Uncertain Data Streams

Lazy learning concept is performing the k-nearest neighbor algorithm, Is used to classification and similarly to clustering of k-nearest neighbor algorithm both are based on Euclidean distance based algorithm. Lazy learning is more advantages for complex and dynamic learning on data streams. In this lazy learning process is consumes the high memory and low prediction Efficiency .this process is...

متن کامل

TECNO-STREAMS: Tracking Evolving Clusters in Noisy Data Streams with a Scalable Immune System Learning Model

Artificial Immune System (AIS) models hold many promises in the field of unsupervised learning. However, existing models are not scalable, which makes them of limited use in data mining. We propose a new AIS based clustering approach (TECNO-STREAMS) that addresses the weaknesses of current AIS models. Compared to existing AIS based techniques, our approach exhibits superior learning abilities, ...

متن کامل

Cost Sensitive Online Multiple Kernel Classification

Learning from data streams has been an important open research problem in the era of big data analytics. This paper investigates supervised machine learning techniques for mining data streams with application to online anomaly detection. Unlike conventional machine learning tasks, machine learning from data streams for online anomaly detection has several challenges: (i) data arriving sequentia...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009